Bioinformatics A Practical Guide to Next Generation Sequencing Data Analysis (Hamid D. Ismail)

Targeted Gene Metagenomic Data Analysis ◾ 257

where p j l

i l q l

(

)

( )

→

is the transition probability between aligned nucleotides j l( ) and

i l( ) and the associated quality score q l

i^{( )}^{, e.g.,}^p^{(T→G, 40).}

The divisive partitioning algorithm begins with all amplicon reads in a single parti-

tion. The error rate is then used to model the number of observed reads of each unique

sequence to compute the p-value of the hypothesis that the number of amplicons of each

unique sequence is consistent with the error model. According to the DADA2 model, for

the unique sequence i with abundance ai be in partition j containing nj reads, the abun-

dance p-value is given as

a a

i^∑

(

)

(

)

(

)

→

= −

∞

poisson

(7.2)

These p-values of the unique sequences are used as the division criteria for an iterative

partitioning. A threshold is specified for partition; if the smallest abundance p-value falls

below the threshold, a new partition is formed with that unique sequence allowing other

similar unique sequences to join it. The division continues iteratively until all unique

sequences falling within a OTUs are consistent with abundance p-values greater than the

specified threshold.

The output of the divisive amplicon denoising algorithm is a collection of ASVs, which

are exact sequences with defined statistical confidence. Because ASVs are exact sequences,

generated without clustering or reference databases, they can be readily compared between

studies using the same target region. DADA2 pipeline generates an ASV table that can be

used for downstream analysis.

7.2.2.2.2 Deblur Denoising

Deblur [9] is a denoising method that uses error profiles of amplicons sequenced by

Illumina MiSeq and HiSeq sequencing platforms to infer error-free sequences. Unlike

DADA2, Deblur operates on each sample independently. The Deblur algorithm begins by

comparing the pairwise Hamming distances of all sequences within a sample to an upper-

bound error profile. The unique sequences are sorted by abundance by ascending from

the most to the least. Neighboring reads are formed for each read based on a Hamming

distance threshold. The number of incorrect reads is then subtracted from the abundance

of the neighboring reads using an upper bound on the error probability. After subtraction,

the sequences with zero abundance are considered as a noise and dropped from the list

of the valid sequences. Deblur can infer the correct sequences. However, it may decline to

remove PCR chimeras that are produced from the aborted PCR cycle.

7.2.2.2.3 UNOISE2 Denoising

Unlike DADA2, UNOISE [10] denoising does not use quality score and it utilizes one-

pass clustering strategy with only two parameters (α and β) with pre-set values. A unique

read sequence (M) in a cluster is evaluated based on its Levenshtein distance (d) from